To install a package pkgName
, simply type install.packages("pkgName")
into the console.
To use functions in the package, we first have to load the package: library(pkgName)
. If the package has not been installed yet, we will get an error. Otherwise, functions in the package are now available for use.
Packages not only give us access to user-created functions, but also user-created datasets. In R, datasets are called data frames.
Let’s load the fueleconomy
package (if you haven’t install this package yet, run this command first: install.packages("fueleconomy")
):
library(fueleconomy)
Load the vehicles dataset with the data
function (to find out more about the vehicles dataset, key in ?vehicles
):
data(vehicles)
An entry vehicles
pops up in the Environment tab. We can see that the dataset has ~33,000 observations with 12 variables.
Let’s view the data with the View()
function (note the capital V). (Alternatively, we can click on “the”vehicles" in the Environment tab.) A new tab pops up in the top-left pane displaying the data. Clicking on the column names allows us to sort the data.
(Note: Some of you might not be able to click on “fueleconomy” in the Environment tab right away. Don’t worry about it, typing View(fueleconomy)
into the console will still work, and you should be able to click on “fueleconomy” after that.)
33,000 observations is a lot of observations to look through. Instead of looking through all of it, we can use various functions to give us a feel for the data.
Use the head
and tail
functions to display the first few or last few rows of the dataset. To control the number of lines shown (default is 6), use the optional n
argument.
head(vehicles)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
tail(vehicles, n = 2)
## id make model year class
## 33441 33306 smart fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart fortwo electric drive coupe 2014 Two Seaters
## trans drive cyl displ fuel hwy cty
## 33441 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33442 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
Under the hood, data frames are implemented as lists, with each column being one element in the list. Hence, whatever we can do with lists, we can do with data frames. For example, we can get the data frame’s column names using name()
:
names(vehicles)
## [1] "id" "make" "model" "year" "class" "trans" "drive" "cyl"
## [9] "displ" "fuel" "hwy" "cty"
To access a particular column, we can use the [[
or $
notation:
vehicles$class[1:10]
## [1] "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD"
## [3] "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD"
## [5] "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD"
## [7] "Midsize Cars" "Subcompact Cars"
## [9] "Subcompact Cars" "Subcompact Cars"
Since the number of columns in a data frame is just the number of elements in a list, we can get the number of columns using length()
:
length(vehicles)
## [1] 12
We can also use the ncol()
and nrow()
functions to get the number of columns and rows of the data frame:
ncol(vehicles)
## [1] 12
nrow(vehicles)
## [1] 33442
Interestingly, data frames can act a little like matrices too. For example, we can use dim()
to figure out the number of rows and columns in the data frame:
dim(vehicles)
## [1] 33442 12
To access the 30th row, we can type
vehicles[30, ]
## id make model year class trans drive
## 30 16734 Acura 3.2TL 2001 Midsize Cars Automatic (S5) Front-Wheel Drive
## cyl displ fuel hwy cty
## 30 6 3.2 Premium 27 17
For an overview of the entire data set, the str
function we introduced last session is very handy. For each column, str
tells us what type of variable it is, as well as the first couple of values for the column.
str(vehicles)
## Classes 'tbl_df', 'tbl' and 'data.frame': 33442 obs. of 12 variables:
## $ id : int 27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
## $ make : chr "AM General" "AM General" "AM General" "AM General" ...
## $ model: chr "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
## $ year : int 1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
## $ class: chr "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
## $ trans: chr "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
## $ drive: chr "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
## $ cyl : int 4 4 6 6 4 6 6 4 4 6 ...
## $ displ: num 2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
## $ fuel : chr "Regular" "Regular" "Regular" "Regular" ...
## $ hwy : int 17 17 13 13 17 13 21 26 28 26 ...
## $ cty : int 18 18 13 13 16 13 14 20 22 18 ...
The summary
function gives us some useful statistics for each variable:
summary(vehicles)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.000
## Class :character Class :character Class :character 1st Qu.: 4.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 5.772
## 3rd Qu.: 6.000
## Max. :16.000
## NA's :58
## displ fuel hwy cty
## Min. :0.000 Length:33442 Min. : 9.00 Min. : 6.00
## 1st Qu.:2.300 Class :character 1st Qu.: 19.00 1st Qu.: 15.00
## Median :3.000 Mode :character Median : 23.00 Median : 17.00
## Mean :3.353 Mean : 23.55 Mean : 17.49
## 3rd Qu.:4.300 3rd Qu.: 27.00 3rd Qu.: 20.00
## Max. :8.400 Max. :109.00 Max. :138.00
## NA's :57
We can also do summaries on just one column:
summary(vehicles$hwy)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 19.00 23.00 23.55 27.00 109.00
For just the mean or median, use the mean
and median
functions on the column of interest:
mean(vehicles$hwy)
## [1] 23.55128
median(vehicles$hwy)
## [1] 23
The sd()
and var()
functions compute the standard deviation and variance of a vector for us:
sd(vehicles$hwy)
## [1] 6.211417
var(vehicles$hwy)
## [1] 38.5817
Note that the default types for the variables don’t always make sense. For example, does it make sense to take the mean of id numbers? To change the type of a column, use the as.x
function (where x
is the type you want to change to):
vehicles$id <- as.character(vehicles$id)
str(vehicles)
## Classes 'tbl_df', 'tbl' and 'data.frame': 33442 obs. of 12 variables:
## $ id : chr "27550" "28426" "27549" "28425" ...
## $ make : chr "AM General" "AM General" "AM General" "AM General" ...
## $ model: chr "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
## $ year : int 1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
## $ class: chr "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
## $ trans: chr "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
## $ drive: chr "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
## $ cyl : int 4 4 6 6 4 6 6 4 4 6 ...
## $ displ: num 2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
## $ fuel : chr "Regular" "Regular" "Regular" "Regular" ...
## $ hwy : int 17 17 13 13 17 13 21 26 28 26 ...
## $ cty : int 18 18 13 13 16 13 14 20 22 18 ...
Look at the output of summary(vehicles)
again. Note that for all the character variables, summary()
doesn’t give us any information on them. One way to get information on character variables is to use the table()
function:
table(vehicles$drive)
##
## 2-Wheel Drive 4-Wheel Drive
## 507 699
## 4-Wheel or All-Wheel Drive All-Wheel Drive
## 6647 1267
## Front-Wheel Drive Part-time 4-Wheel Drive
## 12233 96
## Rear-Wheel Drive
## 11993
Another way we can get information on character variables is by converting them to factors. Factors represent categorical variables: i.e. values fall into one of several categories (e.g. gender, age group). Categories can be unordered (e.g. gender, we call them nominal variables), or ordered (e.g. age group, we call them ordinal variables).
We can make a character variable into a factor variable by using factor()
. Notice now that summary()
gives more useful information. (By default, factor variables are nominal variables.)
vehicles$drive <- factor(vehicles$drive)
summary(vehicles$drive)
## 2-Wheel Drive 4-Wheel Drive
## 507 699
## 4-Wheel or All-Wheel Drive All-Wheel Drive
## 6647 1267
## Front-Wheel Drive Part-time 4-Wheel Drive
## 12233 96
## Rear-Wheel Drive
## 11993
Let’s look at the internal structure of the factor variable:
str(vehicles$drive)
## Factor w/ 7 levels "2-Wheel Drive",..: 1 1 1 1 7 7 7 5 5 5 ...
Notice that the words (“2 Wheel Drive”, etc.) have been changed into numbers! That’s because R assigns each category a number. We can see this assignment somewhat by calling levels()
, which shows us the “levels”, or categories, for this variable:
levels(vehicles$drive)
## [1] "2-Wheel Drive" "4-Wheel Drive"
## [3] "4-Wheel or All-Wheel Drive" "All-Wheel Drive"
## [5] "Front-Wheel Drive" "Part-time 4-Wheel Drive"
## [7] "Rear-Wheel Drive"
So 2-Wheel Drives are labeled 1, and so on. By default, R assigns this internal labeling by alphabetical order. This internal labeling is usually not a concern to us. See optional material section for more details.
Let’s compute the mean number of cylinders in our dataset:
mean(vehicles$cyl)
## [1] NA
Hmm, we get an NA
? What’s happening? If we look through the cyl
column, you’ll find that some of the entries are NA
. Look at the documentation for the mean
function and you’ll see that there is an na.rm
option, with default value FALSE
. This means that by default, mean
will not remove any NA
s that it sees, and will return NA
if any one of the elements is NA
.
We can get the mean as follows:
mean(vehicles$cyl, na.rm = TRUE)
## [1] 5.771867
Working with NA
s can be tricky sometimes because they don’t always show up. For example, the output of table
doesn’t show you the NA
s, which could mislead you into thinking that there are no NA
s in the column:
table(vehicles$cyl)
##
## 2 3 4 5 6 8 10 12 16
## 45 182 12381 718 11885 7550 138 478 7
The summary function does tell us though if there are NA
s in a column:
summary(vehicles$cyl)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.000 4.000 6.000 5.772 6.000 16.000 58
To test if something is an NA
or not, use the is.na
function.
is.na(NA)
## [1] TRUE
What if I just want to look at observations which have more than 8 cylinders? To do that, we first need to know another way of extracting elements from a vector. Consider the vector below:
vec <- 1:3
To extract a group of elements from vec
, we previously used square bracket notation, with a vector of indices that we wanted to extract:
vec[c(1,2)]
## [1] 1 2
Another way to extract elements is by putting a logical vector of the same length in the square brackets. R will then extract those elements which match to TRUE
. For example, the code below extracts the first and third elements:
vec[c(TRUE, FALSE, TRUE)]
## [1] 1 3
To extract all the observations with more than 8 cylinders, we can do this:
df <- vehicles[vehicles$cyl > 8, ]
table(df$cyl)
##
## 10 12 16
## 138 478 7
To extract observations with exactly 8 cylinders (notice the double equal sign):
df <- vehicles[vehicles$cyl == 8, ]
table(df$cyl)
##
## 8
## 7550
To extract observations such that the number of cylinders is not 8:
df <- vehicles[vehicles$cyl != 8, ]
table(df$cyl)
##
## 2 3 4 5 6 10 12 16
## 45 182 12381 718 11885 138 478 7
This is the “old” way of filtering datasets. (Next week, we’ll talk about a newer way to do filtering and other data transformations.)
Instead of just the first or last few rows, we may want to view a random sample of rows from the data frame. We can do this by composing functions that we already know with sample()
:
vehicles[sample(nrow(vehicles), 5), ]
## id make model year class
## 10392 14073 Eagle Talon 1998 Subcompact Cars
## 16719 9932 Hyundai Elantra 1993 Compact Cars
## 25881 362 Plymouth Horizon 1985 Compact Cars
## 11913 33187 Ford F150 Pickup 2WD 2013 Standard Pickup Trucks 2WD
## 4923 8563 Chevrolet Corvette 1992 Two Seaters
## trans drive cyl displ fuel hwy cty
## 10392 Manual 5-spd Front-Wheel Drive 4 2.0 Regular 30 20
## 16719 Automatic 4-spd Front-Wheel Drive 4 1.8 Regular 26 20
## 25881 Manual 5-spd Front-Wheel Drive 4 2.2 Regular 33 22
## 11913 Automatic 6-spd Rear-Wheel Drive 6 3.5 Regular 22 16
## 4923 Manual 6-spd Rear-Wheel Drive 8 5.7 Premium 23 15
R doesn’t have a built-in function to compute the mode. We can either write our own function (a number of people have done that, do a google search), or we can use some other functions which allow us to figure out what the mode is.
First, the table
function tells us how many times each value appeared in the column:
table(vehicles$hwy)
##
## 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## 13 66 62 275 295 453 847 1257 2094 1547 1605 2314 1400 2672 2383
## 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
## 2788 1944 2712 1558 1448 1371 846 799 528 515 358 313 205 125 106
## 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## 125 79 56 46 20 52 55 9 10 8 14 2 4 7 1
## 54 58 59 60 61 62 64 65 68 69 74 79 90 92 93
## 3 4 2 1 1 2 3 2 2 2 3 2 3 2 4
## 96 97 99 101 102 105 108 109
## 2 2 6 2 1 3 2 1
To find out which number appeared most often, we have to visually scan the whole table. We could sort the table to help us:
sort(table(vehicles$hwy))
##
## 53 60 61 102 109 50 59 62 65 68 69 79 92 96 97
## 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
## 101 108 54 64 74 90 105 51 58 93 99 52 48 46 47
## 2 2 3 3 3 3 3 4 4 4 6 7 8 9 10
## 9 49 43 42 44 45 41 11 10 40 38 37 39 36 12
## 13 14 20 46 52 55 56 62 66 79 106 125 125 205 275
## 13 35 34 14 33 32 31 30 15 16 29 21 28 18 27
## 295 313 358 453 515 528 799 846 847 1257 1371 1400 1448 1547 1558
## 19 25 17 20 23 22 26 24
## 1605 1944 2094 2314 2383 2672 2712 2788
The mode is the last entry (24, appearing 2788 times). To have the mode appear in front, adding a decreasing = TRUE
argument to the function call:
sort(table(vehicles$hwy), decreasing = TRUE)
##
## 24 26 22 23 20 17 25 19 27 18 28 21 29 16 15
## 2788 2712 2672 2383 2314 2094 1944 1605 1558 1547 1448 1400 1371 1257 847
## 30 31 32 33 14 34 35 13 12 36 37 39 38 40 10
## 846 799 528 515 453 358 313 295 275 205 125 125 106 79 66
## 11 41 45 44 42 43 49 9 47 46 48 52 99 51 58
## 62 56 55 52 46 20 14 13 10 9 8 7 6 4 4
## 93 54 64 74 90 105 50 59 62 65 68 69 79 92 96
## 4 3 3 3 3 3 2 2 2 2 2 2 2 2 2
## 97 101 108 53 60 61 102 109
## 2 2 2 1 1 1 1 1
By default, when we make a variable a factor, R assigns an internal labeling by alphabetical order. This usually doesn’t concern us. One instance where we might want to have more control over the ordering is when we plot the data: for a bar plot, the category labeled 1 goes on the left-most end, followed by 2, etc.
barplot(table(vehicles$drive))
If we want to, we can set the order ourselves by specifying a levels
argument. Let’s flip the labeling:
vehicles$drive <- factor(vehicles$drive,
levels = sort(unique(vehicles$drive), decreasing = TRUE))
levels(vehicles$drive)
## [1] "Rear-Wheel Drive" "Part-time 4-Wheel Drive"
## [3] "Front-Wheel Drive" "All-Wheel Drive"
## [5] "4-Wheel or All-Wheel Drive" "4-Wheel Drive"
## [7] "2-Wheel Drive"
Note how the barplot is now “flipped”:
barplot(table(vehicles$drive))
For ordinal variables, we need to add an ordered = TRUE
argument to factor()
:
vehicles$drive <- as.character(vehicles$drive)
vehicles$drive <- factor(vehicles$drive, ordered = TRUE)
str(vehicles$drive)
## Ord.factor w/ 7 levels "2-Wheel Drive"<..: 1 1 1 1 7 7 7 5 5 5 ...
levels(vehicles$drive)
## [1] "2-Wheel Drive" "4-Wheel Drive"
## [3] "4-Wheel or All-Wheel Drive" "All-Wheel Drive"
## [5] "Front-Wheel Drive" "Part-time 4-Wheel Drive"
## [7] "Rear-Wheel Drive"
head(vehicles$drive)
## [1] 2-Wheel Drive 2-Wheel Drive 2-Wheel Drive 2-Wheel Drive
## [5] Rear-Wheel Drive Rear-Wheel Drive
## 7 Levels: 2-Wheel Drive < ... < Rear-Wheel Drive
This section is for documentation purposes: By displaying my session info, others who read this document will know what the system set-up was when I ran the commands above.
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] fueleconomy_0.1
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.1 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2
## [5] tools_3.5.1 htmltools_0.3.6 yaml_2.1.19 Rcpp_0.12.17
## [9] stringi_1.2.3 rmarkdown_1.10 knitr_1.20 stringr_1.3.1
## [13] digest_0.6.15 evaluate_0.10.1